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This paper addresses video inter-intra similarity retrieval for pornographic classifi- 
cation. The main approaching method is obtaining the internal representation and 
external similarity between a single unlabeled video and batches of labeled videos, 
then combining together to determine its label. For the internal representation, we 
extracted inner features within frames and clustered them to find the representative 
centroid as the intra-feature. For the external similarity, we utilized a similarity video 
learning named ViSiL to calculate distance score between two videos using chamfer 
similarity. With distance scores between input video and batches of pornographic/non- 


pornographic videos, the inter feature of the input video is obtained. Finally, the inter 
similarity vector and the intra representation are then concatenated together and fed to 
a final classifier to identify whether the video is for adults or not. In experiment, our 
method performs 96.88% accuracy on NPDI-2k, achieved a comparative result com- 
paring to other state-of-the-art methods on the pornographic classification problem. 
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1. INTRODUCTION 

Based on the popularity of internet-based video sharing services, the quantity and diversity of images 
or videos on the world wide web have reached unprecedented scales, thus making it difficult to categorize. 
Moreover, with the increase of sexual websites with no restriction, the demand for filtering and preventing 
adult content from reaching the youngster becomes essential. In recent years, many efforts have been made to 
distinguish pornographic visual content from normal ones. Existing studies about pornographic visual recog- 
nition and classification can be divided into four categories depending on their approaches, namely skin-based 
approach, handcrafted feature-based approach, deep learning-based approach, and object-based approach. 

The skin-based approach involved recognizing the exposed skin ratio to decide if there is a nude 
person in that image. To improve performance as well as achieve higher prediction, shape and color features 
can be combined and determined under mathematic and statistic thresholds. Furthermore, upper features such 
as facial or body organs localization can be adapted to strengthen the accuracy of skin ratio estimation. Several 
approaches on skin-based can be adapting various color spaces on different body areas [1], combining skin 
detection and face localization [2], or utilizing a pre-train discrimination model followed a skin extraction 
B]. However, a high ratio of wrong-prediction based on the assumption of exposing skin rather than visual 
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understanding [4] reduce significantly the performance of these methods. 

The handcrafted feature-based approach extracts visual features from images from the same label 
and maps them into a dictionary codebook. Then, a machine learning classifier is adapted to identify the 
pornographic elements based on that codebook’s features. To describe pornographic features on images, var- 
ious feature descriptors or extractors are used, such as scale-invariant feature transform (SIFT)/Hue-SIFT [5], 
BossaNova [6], or temporal robust features [7], which is a space-temporal detector using Fisher Vector fea- 
ture representation. Despite the effectiveness of identifying pornographic content, the complexity, diversity of 
pornographic content as well as the omission of spatial relationships make it difficult to determine appropriate 
features to describe visual pornography. 

Deep convolutional neural network (CNN) has been an effective method to tackle pornography recog- 
nition problems with state-of-the-art performances [/8]-[13]. Rather than selecting appropriate features manu- 
ally, deep neural network models extract features and refine learning parameters automatically, thus improving 
the model performance. These studies often use pre-trained deep CNN models on large-scale datasets such as 
ImageNet, common objects in context (COCO), and fine-tune with a custom small-scale dataset for a specific 
task, in this case, visual pornography detection. However, the sensitiveness of this research topic prevents these 
studies to publish their dataset widely, thus making it difficult for training, evaluating, and comparing deep 
learning models on the same tasks. 

Recent pornography detection studies [14]-[16] focus on detecting sexual objects and organs within 
image/video frames, then determining the sensitiveness of input visual content. While choosing the appropriate 
sexual objects depended heavily on the study scale and perspective of recent studies, these studies adapted an 
existing object detector and fine-tuned it on a custom labeled dataset to be able to identify sexual object recog- 
nition properly. Noticeably, Tabone et al. proposed seven sexual organs classification on images included 
buttocks, female breast, female genital, which are divided into two sub-classes: female genital posing and fe- 
male genital active), male genital, sex toys, and non-porn (normal) class. The authors annotated those objects 
with five-set labeled points: one center point and four perpendicularly offset for each. However, the biggest 
challenge of this method is the lacking of data, as there aren’t any large-scale visual datasets for sexual organ 
detection yet for training an effective detection model. While the object-based approaches can ensure the right 
prediction in most cases, the strong resemblance between sexual objects with common items in some special 
cases or viewpoints (such as dildo and sausage) makes it difficult to make the right prediction. In our previous 
studies that focus on identifying sexual objects and organs on object-based approach [18]-[20], we labeled four 
sexual organs male/female genitals, female breast, and anus with polygon mask for both object detection and 
instance segmentation tasks. With the labeled dataset, we not only developed a sexual object detector based on 
mask R-CNN but also utilized the training strategy with two steps learning that helps the detector overcome 
the false positive prediction on sexual objects, thus enhancing the performance of recognizing and classifying 
pornography content. 

Previous studies about pornographic video recognition mostly experiment on the nucleo de proces- 
samento digital de imagens (NPDI) pornography datasets [6], [7]. However, these methods predominantly 
used the extracted key-frames that NPDI’s author provided feeding to their model, rather than learning the 
representation throughout of the video that limited the model’s performance. In our previous experiments on 
pornography videos [18-20], we extracted key point frames throughout the whole videos of NPDI instead of 
using provided key-frames, as we believe it comes with a better result in precision. 

In this paper, we proposed a method that calculates and combines the similarity inter and intra fea- 
tures between videos to recognize pornography. We consider this approach to be the first of its kind in this 
pornographic recognition area. The input video is fed to the anabranch inter-intra feature stream. The ‘intra 
branch’ obtains appropriate video inner representative throughout the temporal axis in the frame-level, while 
the ‘inter branch’ calculates the similarities of input with a set of videos. Then, we combine features from the 
two branches and feed them to a classifier to determine the pornographic label. Evaluating our method on the 
NPDI-2k dataset, we achieved a competitive result of 96.88% accuracy. 


2. METHOD 
Normally, a video usually defined as a sequence of frames connect together in a temporal dimen- 


sion. Thus, the basic approach is spliting video into frames and working with them for action recognition 
[21], searching [22], or information retrieval [23], [24]. For the methodology, we believe if there is a strong 
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resemblance between an unlabeled video a with a set of videos S sharing the same label l, then there is a high 
probability that video a have the same label l. However, the similarity, no matter how strong it is, does not 
fully reflect the true nature of the relationship between a and S, as some internal features of the video does 
affect that relationship. Therefore, we come up with an approach that leverages both inner features of the video 
and outer relationships with others to determine the video’s main characteristic, particularly, the eroticism of 
the video itself. 

Figure [I] depicts the overview of our proposed method. The figure portraits the structure of the 
anabranch river to leverage the advantages of inter-features learning and intra-features learning. Overall, the 
input video is brought to the ‘inter branch’ for similarity scoring and the intra” branch for feature extracting. 
Outputs from these branches are then concatenated to generate the joint representation of input video. Finally, 
the representation is fed to a classifier for video classification. The description of our approach in detail is 
described at follows. 
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Figure 1. Proposed method’s pipeline. For selecting clusters to represent the video’s intra-feature, we either 
choosing the largest cluster’s centroid (method 1), or concatenating all the centroids together (method 2) 


2.1. Video representation 
2.1.1. Intra-feature branch 

Initially, input video is fed to the intra-feature branch and split into frames. To extract internal rep- 
resentation of videos, our main approach is to find similar frames across the temporal dimension and cluster 
them to choose the feature. Although sharing the same idea with [21], rather than training a deep CNN model 
to learn how to cluster frames based on hamming distance, we utilized an unsupervised clustering algorithm to 
simplify the computational effort, thus reduce the complexity in time. 

More specifically, a feature extractor is adapted to retrieve the representative vector of each frame. 
With all the extracted vectors, we considered them as data points and adapted the K-means clustering algorithm 
to cluster them — thus obtained the appropriate representative of the input video. With the divided clusters as 
well as their centroids, we came up with two methods for obtaining the representation vector Vinner, includes: i) 
select the largest cluster’s centroid and obtained it’s feature vector in (I), ii) obtained all the centroids’ vectors 
and concatenate them together in (2). The largest cluster is defined as the cluster contains the largest amount of 
data points (in this case, the frames’ feature vectors), and the cluster’s centroid is the means of all data points 
within a single cluster. 


Vinner = Centroidmax(Cluster, y+. Clusterp) (1) 


Vinner = concat(Centroid, ,... ,Centroid,) (2) 
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2.1.2. Inter-feature branch 

In the inter-feature branch, ViSiL is adapted to calculate the spatio-temporal relations between 
a pair of videos. The main approach of ViSiL is estimating the pairwise frame similarity between videos by 
apply TensorDot and mean-max filter chamfer similarity (CS) on the region frame feature. After that, the 
frame-similarity matrix is then feeding to a four-layer CNN followed by CS again to obtain spatial-temporal 
similarity vector and score between videos. The chamfer similarity, the similarity counterpart of chamfer 
matching [25], is calculated by averaging similarity of the most similar item in set y for each item in set x 
to determine the closeness score. The differential between the similarity vectors and scores between pairs of 
relevant and irrelevant videos is presented in Figure [2] 


frame-to-frame video-to-video 
similarity matrix similarity 


t ViSiL 


relevant videos 


-l > > 0.1 


ViSiL 
irrelevant videos 


Figure 2. ViSiL spatio-temoral similarity scores 


For the frame-to-frame similarity, with two video frames a, b, the region feature maps are extracted 
and decomposed by into region vectors a; j, by. Then, the CS is adapted to calculate the similarity: 


CS frame(@,b) = 3 max aj T bri (3) 


mÈ 1 kle[LA] 


after that, a frame-similarity matrix that comprising pairwise frame similarities is fed to a four-layer CNN. 
Finally, the element-wise hard tanh activation function and CS is applied on the 1D tensor of the CNN output 
to obtain the similarity score between pair of videos x,y: 


1< 
CS yideo(x,y) y2 max Htanh(S*» (i, j)) (4) 
7 JELLY’ 


where S*” € R*’*”" indicates the output of the CNN network and Htanh is the hard tanh function. 

In our proposed approach, the input video is fed to ViSiL to compare with all N videos from the 
training set, including 50% pornographic video and 50% non porngraphic ones. All the similarity scores are 
then concatenated into an external feature Vouter with N dimesion: 


Vouter = Concat(S(x 1); S2) e »Scewy) (5) 
where S(, ;) is the similarity score between input video x and the i™ video. 
2.1.3. Joint representation 
The joint representation vector is calculated by concatenating intra and inter features together in (6). 
which outcome is a x +y dimensional vector where x and y are the dimension of the features, respectively. The 
detailed of x and y will be discussed in the experimental section: 
Vjoint = concat(Vinner, Vouter) (6) 


finally, the concantenated representation between inter and intra vector is then fed to the final classifier to 
determine if the input video is pornographic or not. 
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2.2. Final classification model 


For the final classification model, we leveraged two classifying models multi-layer perceptron (MLP) 
and support vector machine (SVM) to discriminate the status of input video with the joint representation. The 
MLP architecture is a three-layer neural network with each layer dimensions are (x, 256), (256, 32), and (32, 
1) respectively, where x is the size of the input representation. The early two layers adapts leaky ReLU as 
an activation function and is followed by a normalization layer, while the last layer adapts sigmoid for binary 
classification in Figure] 


Dense Dense Dense 
256, LeakyReLU 32, LeakyReLU 1, Sigmoid 


Figure 3. The 3-layer MLP architecture 


3. EXPERIMENTAL RESULTS 

In the experimental, we trained and evaluated our model using the NPDI-2k dataset [Z], which con- 
tains 1,000 pornographic videos and 1,000 non pornographic videos. The NPDI-2k videos range from several 
seconds to thirty minutes approximately, with the frame rates from 15 to 25 FPS. For the evaluation process, 
the two-fold cross-validation is applied five times, which is similar to the experimental scenarios used in [7], 
and the final outcome is the mean of five performances. For each cross-validation phase, the number of video 
is divided equally by two, which makes both training and testing sets contain 1,000 videos (with 500 porn and 
500 non-porn videos) for each. To reduce the computational cost, rather than utilizing every single frame per 
video, we extracted only one frame per second for both similarity calculation and intra-feature extraction. 

During the testing process, we utilized ResNet101 and DenseNet121 for feature extraction in the 
intra branch, with two adapted models are pre-trained on the ImageNet. The obtained feature for each frame 
has a size of 2,048 or 1,024 respectively. Then, applied the K-mean clustering on these feature vectors, we 
experimented only with 2 and 3 clusters to ensure the quality of video inner representation. The reason why we 
selected up to 3 clusters for the experiment not only to maintain the performance and computational cost of our 
method but also because the minimum amount of frames we could extract from a single video are 3 (equivalent 
to a three-second video). The inner feature vector shares the same dimension with the corresponded frame 
feature, depending on the adapted feature extractor. On the other hand, the outer feature created by calculating 
similarity scores through the inter branch is a 1,000 dimensional vector. 

For the final classifier, on the one hand, the MLP model was trained on colab pro with P100 GPU, with 
the configuration includes 800 epochs, learning rate 0.005, and batch size of 32. On the other hand, the SVM 
model is utilized with four kernels: linear, polynomial, radial basis function (RBF), and sigmoid. With the 
outcome prediction is the binary label, we expected that the linear kernel comes with the highest performance 
among the four. 


Table 1. Overall results 


Inner representative Intra-feature Classification Performance with Performance with 
method! extractor model 2 Clusters{Acc) 3 Clusters%Acc) 

Choosing the ResNet101 MLP 96.76 96.76 
largest cluster’s SVM? 95.04 95.68 
centroid (method 1) DenseNet121 MLP 96.50 96.46 

SVM 95.72 95.22 
Concatenating ResNet101 MLP 96.80 96.88 
all the clusters’ SVM 95.52 95.56 
centroids (method 2) DenseNet121 MLP 96.74 96.50 

SVM 95.78 95.62 


Acc — Accuracy; MLP — Multi-layer Perceptron; SVM — Support Vector Machine 
1 Method to determine the inner representation after clustering frame-features 
2 Model performance with corresponded number of inner-feature cluster 
3 All the results that use SVM model only utilize the linear kernel 
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Overall, the frame’s feature retrieval with ResNet network achieves better results than their counterpart 
DenseNet, while the concatenating of all clusters’ centroids — after cluster all frame-features using K-means 
clustering — helps model achieve higher Accuracies than only choosing the largest one (Table[I). Also, dividing 
into 3 groups for the clustering algorithms comes with higher performances than 2. In the final classification 
model, our approach’s results when using MLP are greater than using SVM to classify videos. As we expected, 
the Linear kernel comes with the highest performance (Table[2). Eventually, the highest result of our approach 
is 96.88% Accuracy with Resnet101 feature extraction; 3 groups clustering; all centroids concatenation for 
intra-representation; and MLP classifier. In the comparison with other methods on the NPDI-2k dataset, our 
method achieved a competitive result (Table B) with the second-highest performance in comparision with other 
methods. 


Table 2. SVM’s kernels performance comparison 


Kernel Performance 


(Accuracy) 
Linear 95.56 
RBF 91.40 
Poly 61.00 
Sigmoid 94.28 


With the detail configuration includes: 
Resnet101 feature extraction; SVM classifier; 
3 groups clustering; and all centroids concatenation 


Table 3. Comparison results on the NPDI-2k dataset 


Method” Performance 

(Accuracy) 
Open-NSFW + Mask R-CNN 90.40 
PROS + MFCC + HOG + TRoF 90.75 
1-Tiered adult detector [9] 91.50 
Space-Time Interest Points [7)"* 94.52 
VGG-16 + Bi-RNN 95.33 
Dense TRoF |7)" 95.58 
Dense Trajectories 95.80 
Two-stream CNN Late Fusion 96.40 
Inter-intra Joint Representation (Our Method) 96.88 
AttM-CNN-Porn 97.10 


* All experiment results are organized by publication years. 
“ Results from the paper that proposed the NPDI-2k dataset 


4. CONCLUSION AND FUTURE WORK 

In this paper, we proposed a novel approach to identify pornographic videos that calculating the joint 
representation of internal and external video’s features. While the intra-features of video can be obtained by 
extracting features in the frame-level with a pre-trained deep learning model and cluster them together, the inter- 
similarity between pair of videos is calculated using mean-max filter chamfer similarity via a spatio-temporal 
video similarity architectural named ViSiL. Both inner and outer features are then concatenated together. After 
that, the joint representation vector is fed to a classifier for video discrimination. Experiments with NPDI- 
2K dataset, our approach demonstrates a competitive performance with recent results, 96.88 % Accuracy in 
prediction. We hope our approach could be an intitial step in developing a better method for video detection 
and classification, especially in the pornographic classification manner. 

However, there are still works to be done. The computation for inter representation costs a large 
amount of resources. Therefore, our priority is to reduce the computational cost, while maintaining its perfor- 
mance. Moreover, recent state-of-the-art methods can be used to improve the effectiveness of the final classifier 
beside the MLP and SVM methods, so that better performance can be achieved. 
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